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ABSTRACT 


Mul t i computers are message passing based 
multiprocessor systems. Here processing units operate 
asynchronously under the control of local controller, one for 
each processing element. Hence a problem that has an 
arbitrarily structured parallelism, can be programmed Mith 
much flexibility on mu 1 1 i computers . In this thesis, a general 
purpose simulator is developed with the motivation of 
providing a test bed for developing and testing concurrent 
algorithms for multicomputer architectures. It is implemented 
in three layers. Process creation and interprocessor 
communication to simulate single processing element is 
implemented in the first layer, the second layer which is 
built over the first is specific to the particular class of 
mul t i computers and provides better user interface. User 
program is implemented calling the primitives provided in the 
second layer. This package provides facilities to simulate 
both, point to point and broadcast communication 
multicomputer architectures. 



TABLE QF CONTENTS 


Page No. 

1. INTRODUCTION 

1 . 1 Mot i vat i on 1 

1.2 Contributions and outline of thesis 2 

1.5 Thesis Organisation 5 

2. DESIGN OF SIMULATOR 

2.1 Introduction 4 

2.2 Simulator Core 4 

2.3 Structure Core 6 

2.4 User Program 7 

2.5 Conclusion 8 

5. IMPLEMENTATION OF SIMULATOR 

3.1 Introduction 9 

3.2 Simulator Core 9 

3.3 Structure Core 17 

3.4 Conclusion 21 

4. EXAMPLES 

4.1 Introduction 24 

4.2 Structure Interface to Hypercube 25 

4.3 Structure Interface to MMS 27 

4.4 Structure Interface to 2_d Mesh 30 

4.5 Structure Interface to Binary tree 32 

4.6 Structure Interface to Broadcast Hypercube 36 

4.7 Conclusion 39 

5 SIMULATION RUNS 

5.1 Introduction 43 

5.2 Summation of numbers on Hypercube 43 

5.3 Summation of numbers on 2_d Mesh 44 

5.4 Matr i x_by__Vector multiplication on Tree 45 

5.5 Conclusion 46 

6 CONCLUSIONS 

6.1 Debugger 53 

6.2 Performance Monitor 53 

6.3 Shortcomings 54 

APPENDIX A 55 

APPENDIX B 58 

APPENDIX C 71 


REFERENCES 


73 




Page No. 


1. Hypercube of dimension 2 (fig 3.1> 22 

2. Naive algorithm for Opening Pipes (fig 3.2) 23 

3. Deadlock Avoidance algorithm (fig 3.3> 23 

4. Hypercube structure (fig 4.1) 40 

3. MMS structure (fig 4.2) 40 

6. 2_d Mesh (fig 4.3) 41 

7. Binary tree (fig 4.4) 41 

8. Broadcast Hypercube (fig 4.5) 42 

♦ 

9. Summation of numbers on Hypercube 

(il lus. , fig 5.2.1) 47 

10. Summation of numbers on Hypercube 

(User’s program fig 5.2.2) 48 

11. Summation of numbers on 2_d Mesh 

( i 1 lus . . f ig 5. 3 .1 ) 49 

12. SufYvnation of numbers on 2_d Mesh 

(User program fig 5.3.2) 50 

13. Matr i x_by_Vector multiplication on Tree 

(illus.,fig5.4.1) 51 


14. Matr i x_by_Vector multiplication on Tree 
(User program fig 5.4.2) 


52 



CHAPTER 1: INTRODUCTION 


Multiple processor systems are noM being increasingly 
used for high speed computations. They can be broadly tae 
divided into two categories - the shared memory systems 
(multiprocessors) and the message passing based systems 
(multicomputers). In the message passing type of 
multicomputer system, processing units operate asynchronously 
under the control of a local controller one for each 
processing un i t[Reed873 . A problem that has an arbitrarily 
structured parallelism, can be programmed with much 
flexibility on a multicomputer system. An effort in the 
direction of developing the parallel programming environment 
for multicomputer systems lead us to the development of the 
package, a general purpose simulator for multicomputer 
architectures. This simulator can be used either in the 
multiple program multiple data mode or in the single program 
multiple data mode depending on whether the functions to be 
executed by the processors are same or not. 

1.1 MOTIVATION 

The motivation for this work is to build a parallel 


programming 

envi ronment 

for 

simulating 

multi computer 

architectures . 

He do 

not 

have knowledge of 

any 

implementat i on 

which have 

a 

comprehens i ve 

approach 

to 


simulation. More precisely, the implementations thus far, 
have been targeted for specific multicomputer architectures 
and cannot be easily adapted to simulate any parallel 
machine. Ulhat we have in mind is to build a general purpose 



this simulator provides a comprehensive platform for parallel 
processing which would encourage experimentation with 
parallel algorithms in various application areas. 

1.2 CONTRIBUTIONS AND OUTLINE OF THESIS 

In this thesis, a general purpose simulator tool, for 
multicomputers is built which can be dynamically 
reconfigured, to simulate any user defined network of 
multi computers . 

A multicomputer can be characterised by a number of 
processing elements and a set of data routing functions 
provided by the interconnection networkCS i ege 179] . The 

processing elements are independent of the network topology, 
and therefore some general purpose routines are designed and 
implemented in the first stage, namely simulator core. In 
order to provide a mechanism for interprocessor communication 
for both point to point and broadcast communication, modules 
are also implemented in this core. 

We have maintained a structure interface for each 
topology, wherein a set of interconnection networks can be 
predefined in structure core. In order to facilitate this a 
number of auxiliary routines are provided in the simulator 
core. Apart from the interconnection network the structure 
core also provides an interface to the interprocess 
communication in accordance with the network defined. 

With the primitives provided by the structure interface, 
user can easily write parallel programs and test his parallel 


algor i thms . 



1.3 THESIS ORGANISATION 


The thesis is organised into six chapters including the 
present chapter. The concepts of process creation, 
interprocess communication, the structure of simulator and 
structure cores are introduced in chapter 2. Chapter 3 
discusses the implementation of simulator. Chapter , 4 
details few examples of interfaces to simulator core. 
Chapter 3 presents the simulation runs, and few examples of 
user programs. Conclusions and scope for further work are 
outlined in chapter 6. 

Appendix A contains the routines available for structure 
core. Appendix B illustrates few examples of structure file. 
Finally, Appendix C explains how to run user's program using 
the simulator. 



CHAPTER 2 : DESIGN OF SIMULATOR 


2.1 INTRODUCTION 

The simulator can be organised into three modules. In the 
first module called simulator core, processing element 
simulation is developed, which has been dealt with in section 
2.2. In the second module, structure core, an interface to 
the previous module is designed to encapsulate all 
configurations of a particular class of mul t i computers . The 
structure core is discussed in section 2.?. The third module 
involves the development of user program with the primitives 
provided by the structure core. In section 2.4 we discuss 
the third module. Finally we conclude in section 2.5. 

2.2 SIMULATOR CORE 

To simulate single processing element of the network, 
the simulator package has to create a process .. Therefore , a 
module for process creation is developed which consists of a 
number of routines. Each newly created process is the exact 
replica of the creating process, since all the processors 
have the same status in the multicomputer network. 

The asynchronous model of communication is chosen which 
is the one generally used in multicomputers . This is also a 
more natural model for the programmer. A synchronous model 
of communication can always be built over the asynchronous 
one in the structure core. The key issues involved in 


interprocessor conwnuni cat ion of both point to point and 



1. Communication links 


2. Design of communication routines. 

2.2.1 Communication links 

The interprocessor links for point to point communication 
networks, can be simulated using named pipesCBach86] , unnamed 
pipes and sockets. The socket mechanism is not chosen in our 
approach because it adheres to a server-client model which is 
inherently different from the type of conrmun i cat i on in 
mu 1 1 i computers wherein all processors have equal status. For 
using unnamed pipes, the parent process must open the pipe 
before creating another process so that the child process can 
share it. As every process uses two pipes for bi-directional 
communication, this approach exceeds the operating system 
defined upper limit of the number of open files per process, 
even for very small multicomputer configurations. 

The advantage of using named pipes is that, it is not 
passed to child process using parent_child inheritance. Thus, 
if we have a pipe naming protocol which gives the name of the 
pipe used for communication between the processors, 
coiwnunication can be established and no process will have to 
open pipes more than twice the number of ports per processor, 
for bi-directional communication. 

However, this approach cannot be used for broadcast 
communication because pipes provide mechanism of 
communication only between two processors. Hence we use 
files with supervisory locks, one for each broadcast bus to 
solve the problem of inconsistency. It facilitates the 



after doing the job by unlocking it to the other processors 
which are connected on the same bus. 


Z.Z.Z Design of Communication Routines 

Simulator uses blocked mode of communication for 
receiving message and unblocked mode of communication for 
sending message. Thus the processor receiving a message 
waits for the message to arrive if it has not arrived 
already. But the sender of the message waits only in case 
the message buffer is full. This wait can be minimised by 
suitable choice of buffer size. Message length is kept as a 
variable. Thus the routine used for sending messages 
requires message length as an argument whereas the routine 
used for receiving messages returns the message length for 
similar reason. 

2.2.3 Auxiliary Routines 

A number of routines are developed which facilitate 
writing the structure core for any topology of multicomputer 
configuration. These are discussed in chapter 3. 

2.3 STRUCTURE CORE 

Design of the structure core depends on how simulator 
works so as to adapt any configuration of multicomputers. At 
the start of simulation, the control is given to the 
structure core. The structure core at this point takes the 
architecture dependent parameters as the input and 
establishes the interconnection network. It then initiates 
the simulator core. The simulator core sets up the simulation 



between them taking input from structure core. It then 
passes the control to the user program. After doing the 
simulation, the simulation environment is disposed by 
terminating the processes and deleting the communication 
links. To incorporate such software protocol, we surwnarise 
the structure of the structure core as given below: 

* an entry point to construct the interconnection network 
of mu 1 1 i computers 

* input interface 

* an entry point to set the simulation environment is 
called 

* an entry point to the user's program is called 

* a protocol to delete the cortvnunicat ion links is called 

* an interface to the interprocess communication for the 
network is defined 

* output interface 

More about structure core is discussed in chapter 5. 
Some examples of structure core are discussed in chapter 4. 


2.4 USER PROGRAM 

User, in his parallel programs invokes the 

simulator using the primitive for process creation given in 

the structure core. In order to communicate the messages 

between the processes the protocol provided by the 

communication interface in the structure core, can be used by 

the user program. Examples of user program can be found in 

chapter 5, Simulation Runs. 

7 



2,5 CONCLUSION 


In this chapter we described the different modules of 
general purpose simulator for mu 1 1 i computers . It can be used 
to simulate variety of the algorithms designed for these 
archi tectures . 

In the later chapters, we will be describing the 
implementation of simulator with few examples of structure 
core and some example user programs. 



CHAPTER 2 ± IMPLEMENTATION OF SIMULATOR 

5.1 INTROOUCTION 

The implementation of simulator can be broadly divided 
into three stages, namely simulator core, structure core and 
user program. In the simulator core, the general purpose 
routines for process creation and mechanisms for inter 
process communication are implemented. This core contains 
parts of simulator, specific to operating system and common 
to all configurations of mult i computers . In section 5.2 we 
discuss the simulator core implementation. In the second 
stage, using these general purpose routines, the primitives to 
simulate any given multicomputer configuration are 

implemented. The structure core is therefore specific to a 
processor topology. In general it is parameter i sed and can be 
used to encapsulate all configurations of a particular class 
of mu 1 1 i computers . The structure core is discussed in section 
5.5. The primitives provided by the structure core are used 
to develop the user program. Finally we conclude this chapter 
in section 5.4. 

5.2 SIMULATOR CORE 

The basic simulation routines of process creation, 
termination and raw mode of communication between the 
processor and its immediate neighbors in case of point to 
point communication, and the processors connected on the same 
bus in the case of broadcast communication are implemented 


here . 



3.2.1 Process Creation 


Creation of process and generation of associated links 
for inter process confwsun i cat i on is implemented through a 
number of routines in simulator core. Let us start with the 
routine to create the process, namely pfork(>. 

PFORK subroutine 
int pfork < i > ; 
i nt i ; 

This subroutine creates the new process. The new 
process is an exact copy of the creating process. The newly 
created process 3 creates and opens the named pipes as 
communication links to its immediate neighbors in case of 
point to point communication or create shared files for each 
bus in case of broad cast communication. If successful. this 
routine returns 0 or else it returns ERR_FORK killing all the 
processes created until now. 

The procedure call create_pi pes ( ) called by each 
newly created process causes the creation of named pipes, as 
unidirectional communication links to its neighbors. For 
implementing bi-directional communication links two pipes are 
used. The pid of the parent is used in the name of the pipe, 
so that more than one simulations active at the same time 
don’t have duplicate pipe names. If successful, the routine 
returns 0 or else returns ERR_P I RECREATE . 


The 

procedure 

call open 

_pipes<) 

called by 

the 

process opens the named pipes 

to i ts 

ne i ghbcrs 

for 

commun i cat i on 

This 

procedure 

returns 

0 in case 

of 



The opening of pipes is not as simple it seems at first 
sight. This is because of the deadlock avoidance scheme in 
Unix for pipes. The problem arises if a process opens a pipe 
for just reading, it is made to wait till another process 
opens the same pipe for writing and viceversaCBach66] . So, if 
each process follows naive algorithm given in fig ?.2 for 
opening the pipes we can have classical deadlock situation 
in the following scenario. 


Consider the hypercube in dimension 2. The processors are 
numbered as shown in fig ?.l. The following sequence of 
events is poss i bl e < even likely to occur): 


i) The processor 0 tries to open the pipe from processor 1 
for reading and gets blocked waiting for processor 1 to 
open it for writing. 


ii> The processor 1 tries to open the pipe from processor 
0 for reading and gets blocked waiting for processor 0 
to open it for writing. 

iii) The processor 2 tries to open the pipe from 
processor 0 for reading and gets blocked waiting for 
processor 0 to open it for writing. 


iv> The processor h tries to open the pipe from processor 
1 for reading and gets blocked waiting for processor 1 
to open it for writing. 


Processors 0. 
situation. Similar 


,2,3 are now in classic 
is the fate of processors 


deadlock 
in higher 


dimensions also. 



A simple solution to this problem would make each 
process open pipes for reading and writing even though it 
might use it for reading or writing only. In this case no 
process has to wait for another. But in unix even this 
solution is unfeasible because, whenever the process closes 
the pipe the data in the pipe is flushed if there are no more 
readers 1 ef tCBach863 . Thus the following sequence of events 
i s poss i bl e : 

i> Process 2 opens the pipe to process 0. 
ii) Process 2 sends the message to process 0. 

iii> Process 2 closes this pipe and exits. 
iv> Process 0 opens this pipe and tries to read the 

message from process 2. Since the pipe has no data 
process 0 waits forever for some message to arrive in 
the pipe. 

The solution adopted in this approach is to have 
processes open the pipes in different order, in accordance 
with a protocol which ensures that a deadlock situation never 
arises. 

According to this protocol: 

1. Process 0 opens the pipe from process 1 and tries to 

read, it will be successful since process 1 opens the 
same pipe for writing first. 

2. Process 1 opens pipe from ? for reading and 5 opens the 

same pipe for writing first and hence process 1 succeeds 

to read the data from the pipe. 



Similar is the case with the other processes. Fig 
shows this protocol. 

The complimentary routine of process creation is to 
terminate a process is called by a process when its execution 
is complete. 

TERMINATE subroutine 
void terminateO; 

This routine when called by a process causes the normal 
termination of the process. 


After 

simulation. 

the 

simulation 

env i ronment 

i s 

disposed by 

termi nat i ng 

the 

processes 

and deleting 

the 

commun i cat i on 

links. To 

delete 

all the files created during 


simulation, the subroutine clean is called by the process 0 
before termination. 

CLEAN subroutine 
VO id clean< ) ; 

This routine when called by the process removes all the 
named pipes and shared files <if created). 

5.2.2 Interprocessor Communication 

Having described the creation and termination of 
processes, we now describe communication routines which are 
called by the processing elements in a multicomputer 
configuration for communicating among themselves. The 
communication routines are implemented by the complementary 
pairs of subroutines namely read__from and write_to. 



WRITE TO subroutine 


int write_to (which, mesg, len> 
int which, lenj 
char *mesg; 

The routine write_to writes Men’ number of bytes into 
the pipe to process ’which*, from the memory location ’mesg’. 
Since the message length is a variable, four bytes containing 
the message length are prepended to the message when it is 
written in the pipe. In case of successful writing 
operation, the routine returns a value 0 or else it returns - 
1 as an error condition. 

READ_FROM subroutine 

int read_from (which, mesg, len); 
int which; char *mesg; 
int ♦len; 

This routine causes the process to read the message 
length indicated by memory location ’len’ and then reads that 
many number of bytes into the memory location mesg, from the 
pipe specified by 'which'. As this is a blocked read 
instruction, execution is suspended till the number of bytes 
indicated by memory location len are read from the pipe. If 
all the requested bytes are read, read_from routine returns 
0 indicating the success of read operation or else it returns 
-1 . 


Now let us discuss the routines which are called by the 

« 

processes for interprocessor communication in case of 
broadcast communication. 



BREAD( channe 1 , mesg, len> 

i nt channe 1 ; 
char *mesg; 
int *len; 

This routine causes, number of bytes indicated by memory 
location Men* to be read from the file, ’channel’ into the 
memory location ’mesg*. In order to overcome the problem of 
inconsistency, the process, opens the shared file ’channel’, 
and locks it and then reads the message and then unlocks it, 
releasing the file for other processors connected on the same 
bus. If the reading operation is successful then the routine 
returns 0 or else it returns -1. 

BWRITE<channel , mesg, len> 

int channe 1 ; 
char *mesg; 
int len; 

This routine causes, ’len’ number of bytes to be written 
to the file, ’channel’ from the memory location ’mesg'. The 
process, opens the shared file 'channel', and locks it and 
then writes the message and then unlocks it, releasing the 
file for other processes connected on the same bus. If the 
writing operation is successful then the routine returns 0 
or else it returns -1. 

3.2.? Support Routines 

In this section, we describe number of support 
routines. These routines simplify the task of writing 
structure core to any topology of muti computer network. We 
start with the routine connectedO which is used to verify 
the connectivity of two given processing elements. 



CONNECTED subroutine 

j^define TRUE 1 
v^define false 0 
int connected(a, b) 
int a.b; 

The subroutine returns TRUE if the processing elements 
whose node ids are *a' and 'b' , are connected by a link, 
otherwise a FALSE value is returned. 

CONNECT subroutine 

connect (a , b) 
int a.b; 

The subroutine establishes an unidirectional link 
between processing elements whose nodeids are 'a' and 'b'. 

BROAD_LINK subroutine 

broad_l i nk <a , b) ; 
int a.b; 

This subroutine establ i shes ,a broadcasting link between 
the processing elements whose nodeids are 'a' and 'b*. 

COMPLIMENT subroutine 

int compl iment (nodeaddr , i , d) ; 
int nodeaddr , i , d ; 

This routine inverts the ith digit of node identifier 
nodeaddr, having 'd’ number of digits , and returns the 
inverted node address. 

GET_DIGIT subroutine 

int get-digi t(nodeaddr , r.d. i ) ; 
int nodeaddr . r . d. i ; 



Subroutine get_digit returns the ith digit from radix 
'r' representation of the node identifier nodeaddr, having 
*d’ digits. 

REPLACE ( nodeaddr , r , i , j ) subrout i ne 
int nodeaddr , r , i , j ; 

Subroutine replace substitutes ith digit of the node 
identifier 'nodeaddr* by the digit '}' and returns 
substituted node identifier nodeaddr. 

3.5 STRUCTURE CORE 

Having described the basic routines, to simulate single 
processing element of mul t i computers we now describe the 
routines in the second layer of simulator. The second layer 
is built on top of the basic routines of simulator core and 
provides better user interface. 

The structure core contains the following modules: 

a. Input interface 

b. Topology setup 

c. Process Creation 

d. Interprocessor Communication 

e. Support routines 

5.5.1 Input interface 

The macro input<> asks for the parameter and reads the 
parameter. Then the total number of processors in the 
network, is computed. 



3.7.2 Topology Setup 

The simulator core provides number of support 
routines to describe the network topology. Number of examples 
of the different multicomputer configurations are discussed 
in the next chapter. This network topology is passed to the 
simulator core to create the processes with appropriate 
communication links. 

3.3.3 Process Creation 

Different types of process creation calls are 
developed depending upon the topology of the mu 1 1 i computers . 
All these routines have been implemented using the basic 
routine of process creation pf ork (>. These are listed below: 

subroutine SFORK 

int sfork(dest) ; 
int dest; 

The sfork call is used in the networks like, ring or 
linear array of processing elements to create the processes 
sequentially. sfork returns 0 for normal operation and 
returns -1 for an error condition. 

subroutine LFORK 

int lfork<parameter> ; 
int parameter; 

The Ifork call is used in multicomputer networks to 
create the process in accordance of the parameter. Here the 
parameter can be dimensionChypercube ,MMS) , level<tree), 
di rection<mesh) etc. The call creates all the processes in 
the specified parameter. The routine returns 0 in normal 
execution and ~1 for an error condition. 



subroutine GFORK 


i nt gforkC > ; 

The gfork call is used to create all the processes of 
the multicomputer network. This routine is used to create 
the processes in the networks of broadcast communication. On 
successful execution, this routine returns 0 or else it 
returns -1 as an error condition. 

>.?.4 Interprocessor Communication 

In this class of routines, data communication among 
processing elements is handled by the enhanced set of 
interprocess communication routines. These routines are 
developed using the basic routines of communication, write_to 
and read__from in case of point to point communication, bread 
and bwrite in case of broadcast communication. 

Point to point communication: 

subroutine LSEND_TQ 

int lsend_to(src , param, dataptr , len) ; 
int src, param; 
char^fdataptr ; 
int -iten ; 

This routine causes processing element *src* to send 
*len' number of bytes to the processing element connected in 
parameter 'param'. This parameter can be leveKtree), 
dimens i on(hypercube ,MMS) , di rect ion(mesh> etc. The 

complimentary routine, to receive the data is handled by 
lrecv_from wherein 'len' number of bytes are read by the 
processing element *dest'. 



subroutine LRECV__FROM 

i nt lsend__to<dest , param, dataptr , len> ; 
int dest, param; 
char dataptr; 
int len; 

These routines have been implemented in accordance with 
the parameter in the interface to different topologies in the 
next chapter EXAMPLES. 

Broadcast coramuni cation; 

subroutine GSEND_TO 

int gsend_to(src. dim, dataptr, len) ; 
int src,dim; 
char*dataptr ; 
int len; 

subroutine GRECV_FROM 

int grecv_from<dest, dim, dataptr, len) ; 
int dest, dim; 
char •♦dataptr ; 
i nt en : 

The gsend_to routine is used by the processing element 
*src' to send data to all processing elements connected to 
the broadcast bus in dimension 'dim*. Whereas the 
complimentary routine grecv^from causes the processing 
element 'dest* to read the data from the broadcast bus 
connected to it in dimension 'dim*. These routines can be 
implemented by using the the basic routines of broadcast 
communication bread and bwrite. The implementation is 
illustrated in the next chapter EXAMPLES under structure 


interface to broad— cast hypercube. 



5.4 CONCLUSION 


In this chapter we presented the implementation of 
simulator suitable for all multicomputer configurations. The 
implementation is divided into two layers. The first layer is 
an implementation of process creation and interprocessor 
communication to simulate single processing element. 

The second layer which is built over the first 
provides better user interface. The subroutine pfork, 
terminate, clean, connected, read_from, write__to implement 
general purpose routines and other subroutines described in 
this chapter are specific to the particular class of 
mu 1 1 i computers . The simulator has been implemented on a sun 
5/60 work station running sun operating system version 


4.0.5c. 








procedure open__pipes_l /# Executed by each process */ 
begin 

for all processors from 0 to N-1 do 
begin 

if connected then 
begin 

open pipe from this processor for reading 
open pipe to this processor for writing 
end 

end 

end 


Figure 3.2 : A naive algorithm for opening pipes 


procedure open_pipes_by_proc i // Executed by each 

process // 

beg i n 

for all processors from j = 0 to N-1 do 
begin 

if connected then 
if ( i > j ) 
begin 

open pipe from this processor for reading 
open pipe to this processor for writing 
end 
else 
begin 

open pipe to this processor for writing 
open pipe from this processor for reading 

end 

end 

end 


Figure 3.3 


Dead_lock avoiding algorithm 
opening pipes. 


fo 



CHAPTER 4; EXAMPLES 


4.1 INTRODUCTION 

The major objective of the thesis is to provide 
facilities for simulation of mul t i computers . A multicomputer 
is characterized by number of processing elements and a set 
of data routing functions provided by the interconnection 
network. The processing elements are independent of the 
network topology and therefore are implemented by the 
simulator core. The interconnection network is specific to a 
class of mu 1 t i computers and is described by the structure 
core . 


The interconnection functions are different for 
different machines. So in order to provide a general purpose 
simulator tool, we maintain structure interface for each 
topology. In this chapter we discuss different examples of 
such interfaces. In section 4.2 we discuss the hypercube 
structure core, MMS (Multidimensional Multi link System) 
structure core in section 4.5, mesh[Hwang85 ] structure core 
in section 4. 4, binary tree structure core in section 4.5 and 
lastly in section 4.6 the broadcast hypercube structure core 
. Finally we conclude in section 4.6. 

The structure core consists of the following 

modules: 

i) Input interface 
ii> Topology setup 
iii) Process creation 
iv> Interprocess communication 
v) Support routines 



Let us discuss these modules Mith respect to different 


topologi es. 

4.2 STRUCTURE INTERFACE TO HYPERCUBEChp_cube . c3 

d V . 

The network consists of N = 2 nodes forming a d 

d 

dimensional hypercube. The nodes are labeled 0,1, ... 2 -1. 

Two nodes are adjacent if their labels differ in exactly one 
bit position. Fig 4.1 shows a hypercube model. 

<i> Input interface: The macro, input(d) reads dimension 'd' 
as an input. Then the total number of processors present in 

d 

the network , N = 2 . 

<ii> Topology set_up: Number of support routines are provided 
in the simulator core to facilitate the set up of topology of 
different multicomputer configurations. The hypercube 
network has been established using these routines as as 
f o 1 lows : 


for paid = 0 to N-1 
for j = 0 to d-1 

connect(peid, compl iment ( pe id, j ) ) ; 

Where peid is the node address of the processing element. 

(iii> Process creation: As described in the previous chapter, 
in section structure core, the routine Ifork is used to 
create the process dimension wise. 

LFORK subroutine 
Ifork (dim); 
in dim; 

The Ifork call creates all the processes in dimension 
•dim'. The subroutine is implemented by the basic routine of 



<iv> Interprocess communication: Using the basic routines of 
communication, read_from, write_to implemented in simulator 
core, the subroutine lsend__to is developed to provide 
communication for any process to its neighbor in the 
dimension specified by ’dim*. These routines return 0 on 
successful execution else return”! as an error condition. 

LSEND_TO subroutine 

int lsend_to (src, dim, dataptr, len>; 
int src, dim; 
char *dataptr; 
int 1 en ; 

LRECV_FROM subroutine 

int lrecv_from (dest, dim, dataptr, len); 
int dest, dim; 
char ^dataptr; 
int len; 

The subroutine lsend_to alloMS 'len* number of bytes to 
be sent by the processing element Mhose address is specified 
by *src* to the processor connected to *src' in dimension 
'dim*. The complimentary routine for receiving data has been 
implemented by the routine lrecv__f rom( > . These routines are 
implemented using the basic routines of interprocessor 
communication, as follows: 

int lsend_to (src, dim, dataptr, len); 

int src, dim; 
char *dataptr; 
int len; 

C 

int neigb; 

neigb = compliment <src,dim); 

write to (neigb, dataptr, len); 


} 



lrecv_from (dest, dim. dataptr, len> 

int dest. dim; 
char #dataptr; 
int len; 

C 

int neigb; 

neigb = compliment <dest,dim>; 
read_from< neigb. dataptr, dim); 

} 

<v> Support routines: Number of subroutines in this section 
include calls for obtaining topological parameters to 
simplify the task of user programming. We start with the 
get_nodeid subroutine used to get the node address of the 
processing element. 

GET_NODEID subroutine 
int get_node_i d< > ; 

This routine returns the node address of the process 
being executed. 

GET_DIM subroutine 
int get_dim( ) ; 

The subroutine returns the dimension of the hypercube 
network . 

GET_NOPROCS subroutine 
int get__noprocs < ) : 

The routine get_noprocs returns the number processors in 
the hypercube network. 

4.5 STRUCTURE INTERFACE TO MMSCmms.c3 

d 

The model consists of N = p nodes forming a MMB 
network in dimension 'd* of drop 'p*. The processors are 



connected if only if the addresses of the processors are 
differed by single bit. Fig 4.2 illustrates this model. The 
processors are numbered from 0 to N-1 . 

<i) Input interface; The macro, input<d) reads the dimension 
and the macro input<p> reads the drop of the network. Then 

d 

the total number of processors in the network = p . 

<ii) Topology set_up! The topology of the MMS network is set 
up by using the support routines provided in the simulator as 
fol lows : 


for peid = 0 to N-1 
for j= 0 to d-1 
for i = 0 to p-1 

if <get_digit(peid, j ,p> != i> 
connect (pe id,replace<peid,p, j , i>) 

Where peid is the node address of the processing 
element. This topology of the network is passed to the 
simulator core to create the processes with the appropriate 
communication links. 

(iii) Process creation: The routine Ifork is implemented to 
create the processes dimension wise . It is same as in the 
case of hypercube network, since, hypercube is the special 
case of MMS structure , wherein drop of the network is always 
two . 

(iv> Interprocess communication: The subroutine lsend_to and 
lrecv_from are developed to provide connmuni cat ion between the 
process and its neighbors in the dimension specified by the 
argument 'dim* and drop 'dr*. These routines return 0 
indicating the success of the operation or else it return -1 . 



LSEND TO subroutine 


int lsend_to (src, dim, dr, dataptr, len); 
int src, dim, dr; 
char *dataptr; 
int len; 


LRECV__FROM subroutine 

int lrecv_from <dest. dim. dr. dataptr. len); 
int dest, dim, dr; 
char ♦dataptr; 
int len; 

These routines are implemented by using the basic 

subroutines of interprocess communication write_to and 

read_from as follows. 

lsend__to<src , dim, dr , dataptr , len) 

int src, dim, dr; 
char ♦dataptr; 
int len; 

C 

int neighb; 

neighb = replaceCsrc.p.dim.dr) ; 
write_to<neighb, dataptr, len) ; 

) 


1 recv_from<dest , dim, dr , dataptr , len) 

int dest, dim, dr; 
char ♦dataptr; 
int len; 

C 

int neighb; 

neighb = replace<dest,p.dim,dr) ; 
read_from< neighb, dataptr. len) ; 

3 


(v) Support routines: 
here in addition to 
hypercube structure, 
the user program. 


The subroutine get_drop is implemented 
the routines that are explained in 
to get the drop of the MMS network in 


GET_DR0P<) subroutine 
int get__drop<) ; 



4.4 STRUCTURE INTERFACE TO MESHCm«#h.cD 


In a mesh network. The nodes are arranged into a 
two dimensional lattice. Communication is allowed only 
between neighboring nodes; hence interior nodes communicate 
with four other processors. Fig 4.3 illustrates a 2_d mesh 
network with no wraparound connections. Let 'n' be the size 
of the mesh. Let N be the total number of processing elements 
in the network. The processors are numbered from 0 to N-1 . 

<i) Input interface: The macro inputCsize) reads the size of 
the mesh to be described. Then total number of processors in 
the network, N = n*n. 

(ii> Topology setup: The network of the 2-d mesh using the 
support routines of the simulator core can be established as 
fol lows : 


for peid = 0 to N-1 

connect<peid, get_ne ighb<pe i d ,LEFT) ) ; 
connect<peid, get_ne i ghb(pe i d ,RIGHT) ) ,- 
connect (peid,get_neighb( peid, ABOVE) ) ; 
connect (pe i d , get_ne ighb(pe id , BELOW) > ; 

The routine get_neighb is described later on under 
support routines. Once structure of the network is 
established, it is passed to simulator core, to create 
processes and the appropriate communication links. 

<iii) Process creation: The subroutine IforkO is implemented 
to create the processes row wise. 


Ifork subroutine 

int ,1 fork (row); 
int row; 



This routine hacs been implemented using the basic routine 
of process creaticwi pforkO. 


(iv) I nterprocesisor communication: The complimentary 
subroutines for communication between each processor of the 
network, to its neighbor in the different directions like, 
LEFT, RIGHT, BEL06H ABOVE, lsend__to and lrecv_from have been 
developed. On suctessful completion of the data transfer, the 
routines return 0- or else they return -1 as an error 
condition. ^ 


subroutine LSEvND_TO 

int lsend_to(fearc,dir,dataptr, len) ; 

int src, dir;3rt 
char *dataptr ;»-c 
int len; ie 


subroutine lrEcv_from 

int lrecv_froin<dest ,dir,dataptr, len> ; 
int dest, dir.i 
char *dataptr; 
int len; i 

These subroutines cause the processing element to send 

the data to or receive the data from the processor connected 

to it. in the direction 'dir'. An implementation of these 

routines using the basic routines of interprocessor 

communication is agiven below. 

int lsend_to<ferc,dir,dataptr , len); 
int src, dir; r 
char *dataptr; 
int len; 

C 

int neighb; 

neighb = get_;'ne i ghb(src ,dir> ; 
wr i te__to( ne i ghb , dataptr ,.l en) ; 

) 



int Irecv__from<dest.dir.dataptr, len>; 
int dest,dir; 
char *dataptr; 
int len; 

C 

int neighb; 

neighb = get_ne ighb(dest .di r> ; 
read f rom<ne ighb.dataptr . len) ; 

} 

The get__ne i ghb< > is described later on in this section 
under support routines. 

<v) Support routines : The routines to get the parameters oT 
the 2__d mesh netMork are implemented here. The routine 
get_si 2 e() returns the size<no. of rows or columns) of the 
2_d mesh network. The routine get_noprocs ( ) and get_nodeid() 
are same as in the case of hypercube structure core. 

Subroutine get_neighb is developed to get the neighbor of 
the processor connected to it, in direction dir. 

Subroutine GET_NEIGHB 

int get_neighb(nodeaddr,dir) 
int nodeaddr , di r ; 

This routine checks whether the processor with node 
address nodeaddr has neighbor in the direction 'dir', if so 
it then returns node address of the processor connected to it 
in direction 'dir.' (LEFT, RIGHT, ABOVE, BELOW) or else it returns 
- 1 ; 


4.5 STRUCTURE INTERFACE TO BINARY TREECtree.c] 

1 

The network consisting of N =2 -1 nodes forms a binary 


tree of heiaht 1. Communication is allowed only between 



their parent and children. Fig 4.4 shows a binary tree. The 
processors are numbered from 0 to N-1. 

i) Input interface: The macro inputCl) reads the no. of 
levels(height> of the binary tree network. Then total number 

1 

of processors in the network, N =2 -1. 

ii> Topology setup: The topology of the network is 
established by using the support routines of simulator core 
as f o 1 1 ows : . 

for peid = 0 to N - 1 

get_parent < pe i d) 
connect <pe i d , get_parent <pe id) > 
connect <pe id , get_lef t__chi ld<pe id) > 
connect (peid,get__right__chi ld<peid) ) 

The routines get__parent, get_lef t_chi Id and 

get_r ight_chi Id are described later under support routines. 

Once the network is established, it is passed to the 

simulator to create the processes with communication links 

accord i ngly . 

iii) Process creation : The routine Ifork creates the 
processes level by level. This routine has been implemented 
by the basic routine of process creation pfork. It returns 0 
on successful creation of the processes or else it returns -1 
as an error condition. 

subroutine LFORK 

int 1 fork < 1 eve 1 ) ; 
i nt 1 evel ; 


iv) Interprocess convYiunication; 


Number of 


interprocess 



communication for the processor with its parent and its 
children of the binary tree network. 


Subroutine SEND_TO_PARENT 

i nt send_to__parent (src .dataptr . len) ; 
int src; char *dataptr; int len; 


Subroutine RECV_FROM_PARENT 

i nt recv_f rom__parent (dest .dataptr , len) ; 
int dest; char *dataptr; int len; 

The subroutine send_to__parent causes the data to be 
sent from the processor specified by the node address 'src* 
to its parent. The complimentary routine recv_f rom_parent 
makes the processor 'dest*, to receive the data from its 
parent. These routines, return 0 on successful execution or 
else they return -1 as an error condition. These routines are 
implemented by using the basic routines of communication 
read from, write to as follows: 


int send_to_parent( src, dataptr, len) 

int src; 

char ♦dataptr; 

int len; 

C 

int dest; 

dest = get_parent < src) ; 
wr i te_to(dest , dataptr , len) ; 

} 

int recv_f rom_parent (dest , dataptr, len) 
int dest; 
char ♦dataptr; 
int len; 

C 

int src; 

src = get_parent ( dest ) ; 
read from(src, dataptr, len) ; 


> 



Similarly the routines 5end_to_lef t_chi Id. recv_from_ 

send to__ right_child and recv from__right child 

are implemented to provide communication between the 
processor and its child. Please refer tree.c in APPENDIX B 
for detai 1 s . 

v> Support routines: The subroutine get_he i ght < > . cal 1 6 j by 
the user program .is implemented to get the number of levels 
in the binary tree network. 

Subroutine GET_HEIGHT 
int get_hei 9htC)3 

The subroutines get_noprocs ( > . get_nodeid(> are same as that 
of other structure cores. In addition to these. the 
subroutines get_parent . get_left_chi Id, get_right_chi Id are 
implemented to ease the writing of topology setup and to 
develop the routines of interprocess communication. 

Subroutine GET_PARENT 

int get_parent{nodeaddr ) 
int nodeadddr ; 

get_parent checks, whether the processor specified 
by node address, ’nodeaddr' is a root of the tree . if so, 
it returns -1 or else it returns the node address of its 
parent processor. 

Subroutine GET_LEFT_CHILD 

int get_left_chi ld<nodeaddr) ; 
int nodeaddr; 

get_left_chi Id verifies, whether the processor 
specified by node address, 'nodeaddr* is a leaf node. If so, 
it returns ~1 or else it returns the node address of its left 



Subroutine GET_RIGHT__CHILD 

int get_right_chi Id(nodeaddr) ; 
int nodeaddr; 

9® verifies, whether the. processor 

specified by node address, nodeaddr is a leaf node. If so. it 
returns ”1 or else it returns the node address of its right 
chi Id processor . 

Now let us go for the structure core for broad ..cast 
communication network. Me discuss only example of it i.e. 
structure interface to broacL-cast hypercube. 

4,6 STRUCTURE INTERFACE TO BROADCAST HYPERCUBE Cbr_hpcub*: .c3 

Fig 4.5 shows a broad-cast hypercube model . Let 'd' be 
the dimension, let ’p’ be the drop of the of the hypercube 
network . Let 'N' be the total number of processors in the 
network. The processors are numbered from 0 to N-1 . 

i) Input interface: It is same as in the case of interface to 
MMS structure with point to point communication. 

ii> Topology setup: Using the support routines of simulator 
core the network of broadcast hypercube is established as 
f ol lows : 

for peid = 0 to N-1 
for j = 0 to d -1 
for i = 0 to p-1 
i f ( get_di g i t (pe i d , j , p> != i) 

broad_l i nk ( pe i d. replace (peid.p, j . i > .get_channel (peid, j )> 
The routine get_channel is described later on under 
support routines. The topology of the network is passed to 
the simulator to create the processes with broadcast links in 



jii> Process creation: The gfork routine creates the 

processes dimension wise creating files. on for each 

broadcast channel. This routine has been implemented using 

the basic routine of process creation pfork. The routine 

returns 0 on successful creation of processes or else it 

returns -1 as an error condition. 

Subroutine GFORK 

int gfork<dim> ; 
int dim; 

iv) Interprocess communication: The routine gsend_to causes 
the processor 'src' to send the data on the broadcast bus 
connected to it in dimension 'dim*. The complimentary routine 
grecv_from fnakes processor dest to receive the data from the 
broadcast bus connected to it in dimension ’dim’. If the 
operation of data transfer is successful the routines return 
0 or else they return -1 as an error condition. 

Subroutine GSEND_TO 
int gsend_to< src . dim. dataptr , len) ; 

int dim. src; char dataptr; int len; 

Subroutine GRECV_FROM 

int gr e c v_f rom( dest .dim. dataptr . len) ; 
int dim. dest; char dataptr; int len; 

These routines are implemented by using the 
broadcast communication bread and bwrite as 

int gsend to (src, dim. dataptr .len) 

int src. dim; 
char dataptr; 
int len; 


basic routines of 
f ol lows : 



c 

int ch; 

ch = set_channel (src.dim) ; 
bwri te< src . ch. dataptr , len> ; 


int grecv_from( dest. dim, dataptr . len) 

i nt dest .dim; 
char dataptr; 
int len; 

C 

int ch; 

ch = get_channel<src ,dim> ; 
bread (dest .ch .dataptr , len) ; 


> 


v> Support routines: The routines get__drop, get_dim. 
get_noprocs are the same as in the case of interface to hypercube 
with point to point communication. 

In addition to these, a support routine get_channel has been 
implemented. 

Subroutine GET_CHANNEL 

int get_channel (nodeaddr .dim) ; 
int nodeaddr , di m; 

This routine returns the channel number to which the 
processor, with node address 'nodeaddr' is connected in 


dimens i on ' dim’ . 



4.7 CONCLUSION 


This chapter has discussed four multicomputer 
configurations of point to point communication, namely 
hypercube, MMS, 2_d mesh, binary tree and one multicomputer 
network of broadcast communication i.e. broadcast hypercube 
model. To provide better user interface, each structure file 
consists of an input interface to read the parameters of the 
network, and interfaces for process creation and interprocess 
communication. Number of support routines are available to 
get the parameters of the network in the user program. 
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CHAPTER 5 : SIMULATION RUNS 


5.1 INTRODUCTION 


In this chapter, we discuss the sample programs of few 
parallel algorithms, that are implemented using the simulator 
package. In section 5.2, we discuss the implementation of 
summation of numbers on hypercube architecture, sunnmation of 
numbers on mesh structure in section 5.5 and matr i x_by_vector 
multiplication on tree structure in section 5.4. Finally we 
conclude in section 5-4. 


5.2 SUMMATION OF NUMBERS ON HYPERCUBE C Qu i nn87 3 

m 

The algorithm to add n = 2 values on a hypercube model 
has been adapted from CQuinn873 . 


procedure SUMMATION(n) 

/* computes a +a +a +...+a 

012 n-1 

result IS stored in a */ 

0 

beg 1 n 

for 1 = logn-1 downto 0 do J* \ •. dimension number */ 

i 

d = 2 

for j = 0 to d-1 do in parallel 
t <= a 
) i+d 

a <- a +t 
) 1 3 

endf or 

endf or 

end 

In the algorithm above, communication of the data item 

local memory into the active 


from an adjacent processor s 



Since, every loop iteration requires constant time, the 
complexity of this algorithm is Otlogn). The algorithm is 
illustrated in fig 5.2.] for n =16. 

This algorithm can be implemented using the primitives 
available in the concerned structure file Chp_cube.c]. The 
routine lfork<dim) is called to invoke the simulator to 
create the processes dimension wise and the routines lsend_to 
and lrecv_from by the process to communicate to its neighbour 
in the given dimension. The user's program implementing the 
above algorithm is illustrated in Fig 5.2.2. 

5.3 SUMMATION OF NUMBERS ON 2_d MESH STRUCTURE 

An a 1 gor i thmCQu i nn87 3 to do the same task on a 2_d mesh 

Z 

connected model is given below. n=l , where 1 be the number 
of rows (or columns) in the model. For simplicity the n 
values to be added are stored, one per processing element. 
The algorithm works by summing all the rows in column 1 and 
then summing column 1. 

When the algorithm concludes the element 

a contains the sum. 

1 .1 

ADDITION (2_d mesh) 
begin 

for 1 <- I-l down to 0 do 

for all P where 1^1.11 do 
, 1 

t <= a +1 /* column i active*/ 

I . i j . i 

a <- a + t 

1.1 j . 1 j . i 
endf or 


endf or 



for 1 <- 1-] downto 0 do 


for all P 

do 

/*onl y 

1 . 1 


t < = a 


1 n 

1,1 1+1 

. 1 


a - a 

+ t 


1.1 1,1 

1.1 



endf or 
endf or 
end 


a single processing element 
column 1 IS active */ 


This algorithm has been successfully implemented and 
tested using the simulator package. The routine lfork<row) is 
called for process creation row wise and Isend to and 
lrecv_from for interprocess communication. These routines are 
available in the corresponding structure file Cmesh.cD. The 
program implementing the summation algorithm is illustrated 
in fig 5.2.2. 

5.4 MATRIX_BY_VECTOR MULTIPLICATION ON TREE STRUCTURE 

The problem addressed in this section is that of 
multiplying an m X n matrix A by an n X 1 vector U to produce 
an m X 1 vector V. Matr ix_by_vector multiplication requires 

m+n-1 steps on a linear array. It is possible to reduce this 

time to m - 1+logn by performing the multiplication on a tree 
connected network. 

The algor i thmCAkl893 is given as a procedure TREE MV 
MULT I PL I CAT I ON. 

procedure TREE MV MULTIPLICATION(A,U , V) 

do steps in parallel 

(1) for i = 1 to n do in parallel 

for j = 1 to m do in parallel 

(1.1) compute u * a 

i ) . 1 

(1.2) send results to parent 
endf or 



(2) for 1 = n+] to 2n-l do in parallel 
while P receives two inputs do 
1 

(2.1) compute the sum of the two inputs 
<2.2> if 1 < 2n -1 then send the result to parent 
else produce the result as output, 
end 1 f 
endwh i 1 e 
endf or 

This algorithm is illustrated in fig 5.4.1 for n=3. 

This algorithm can be easily implemented calling the 
routines available in tree structure file ftree.cD. 
lfork<level) for creating the processes level wise. 

1 send_to_parent and 1 recv_f rom_rh i 1 d and 1 recv_f rom_r ch i 1 d 
for interprocess communication. The user program implementing 
the above algorithm is illustrated in fig 5.4.2. 


5.5 CONCLUSION 


The simulator. can be easily used as a test bed to 
verify the parallel algorithms, for different multicomputer 
architectures. User should use only the routines that are 
available in the concerned predefined structure file. To run 
the user program he should read the instructions given in the 


appendix C. 
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SUMMAT I ON < numbe r s > 
int numbersC s i ze 3 ; 


int Isum.dim.d.k ,dl ; 

int 1 en , f rom__source , node_id; 

for< dl = 0; dl < get_dim(); dl++) 
lfork(dl> ; 

node_id = get_node id< > ; 
dim = get_dim< > -1 ; 


/♦add the numbers dimension_vise*/ 
Isum = numbers C node__i d] ; 
for< d = dim; d >= 0; d--) 

£ 

k = power <2 .d> ; 
if < node_i d >= k> 

C 

1 send_to < node_i d, d,&lsum,sizeof( int>> ; 
termi nate < > ; 

} 

else 

C 

1 recv_f rom<nOde_i d ,d,&f rom_source ,&len> ; 

Isum += from_source ; 

} 

} 

if( node__id == 0> 
return Isum; 

} 


Figure 5.2.2 : User program for summation 

on hypercube 





MESH_SUMMAT ION (numbers) 
int numbersCsize] ; 

C 

int Isum, len,node_id. from_source; 
int r.kl .row, sizel .column; 

process 
wise ♦/ 


V i se*/ 

for( column = sizel-1; column > 0; column--) 

C 

if ( node_id % sizel == column) 

C 

1 send_to<node_i d .LEFT ,&1 sum, sizeof(int)) ; 
termi nate< ) ; 

3 

else 

if((node_id % sizel) == column-1) 

C 

1 recv_f rom( node_i d , RIGHT .&from_s our ce ,&1 en ) ; 

Isum += from_source ; 

3 

3 

/* only first column is active */ 
for( row = sizel-1; row > 0; row--) 
i 

if<<node_id /sizel == row) S:&(node_id % sizel == 03) 

C 

1 send_to(node_id. ABOVE, &1 sum, sizeof( int)) ; 
termi nate < ) ; 

3 


else 

i f ( (node_id/sizel == row-1 )&&< node_i d % sizel == 0)) 
C 

kl = node_id fsizel; 

lrecv_f rom<node_id. BELOW. &from_source ,&len) ; 

Isum += from_source ; 

3 

3 

if< node_id == 0) 
return Isum; 

3 


for( r = 0 ; r < sizel ; 
lfork<r) ; 

node_id = get__node i d( ) 
sizel = get_meshsize( ) 
Isum = numbersCnode id] 


r++) /^create the 

row 


/♦add the numbers column 


5.5.2: User program for summation 
on 2 d mesh 


F i gure 













matr ix_by_vector< A,U, V> 
int ACsizeDCsizel ] ; 
int UCsizel]; 
f 

int 1 , i ,n, m = 0 ; 
int node_id, len; 
int 1 source , rsource ; 

for< 1 = 1; 1 < get_he ight< > ; l++> 

Iforkd ) ; 

node_id = get_nodeidO; 

1 = get_he i ght < >■; 
for< i = 0 ; i < n; i++) 

C 

/♦ Compute results and send to parent */ 
i f ( leaf_node(node_id> > 

C 

index = node_id - (power ( 2 , 1-1 >-l > ; 
product = UC index] * AC i]C index]; 
send_to_parent ( node_i d , ^product , s i zeof ( i nt ) > ; 

} 

else 

/♦Receives two inputs and find the sum^/ 
recv_f rom_l ef tch i 1 d( node_id ,&1 source ,41 en> ; ' 
recv_f rom_r i ghtchi ld<node_id,4rsource .41en) ; 
product! = Isource + rsource; 

if<node_id != 0> /♦ If not root node send the 

result to parent ♦/ 

send_to_parent<node_id, 4productl .sizeof < int) > ; 

else 

C 

VCm] = product!; /♦produce the result as 
m++; output^/ 

} /♦else^/ 

} /♦else^/ 

} /♦for^/ 

Matr i x_by_vector multiplication 
on tree structure 


Figure 5-4.2 : 



CHAPTER 6 ; CONCLUSIONS 


In this thesis, a simulator package is developed. It is 
a platform for testing users parallel algorithms written to 
run on multicomputer architectures. In order to make this 
package complete on its own, some features are to be added. 
A debugger to correct user’s program is dealt in section 6.1, 
a performance monitor is discussed in section 6.2. Finally 
the shortcoming of this simulator is discussed in section 
6 .?. 

6.1 DEBUGGER 

In this simulator, we have used signals to take care of 
errors occurred during simulation. Wherein, the process at 


whi ch 

error has 

occurred 

sends the signal to 

the 

parent 

process 

( process 

0) 

wh i ch 

in 

turn distributes 

the 

signal 

among 

a 1 1 processes 

created, 

terminating them 

wi th 

error 


condition. Added to these we should provide a debugger for 
the user in such a way that he should be able to correct his 
program with least difficulty. 

6.2 PERFORMANCE MONITOR 

In most of the cases, a specific multicomputer network 
is more feasible than the rest for the given parallel 
algorithm. Hence a facility can be provided, so that the 
simulator can evaluate the performance of the user s program 
on different multicomputer networks and should come out with 
the most efficient one with minimum communication costs. 



6.5 SHORTCOMINGS 


The main shortcoming of the implementation is that, the 
size of multicomputer network that can be simulated is 
limited. This is because, the operating system imposes an 
upper limit on the total number of processes and also on the 
number of processes that a single user can have running at 
the same timeCSunSB]. Thus simulator can't simulate very 
large network (bigger than 64 processors). This shortcoming 
can be removed with the usage of user level thread package. 



APPENDIX A 

Rout ines^avai lable to write structure file 

NAME 

connect - establ-ish the link between two processors. 

SYNOPSIS 

j^^include "structure .h" 
void connect (a , b-) 
i nt a , b ; 

DESCRIPTION 

Connect establishes the unidirectional link between two 
processing elements whose node ids are ’a* and ’b’. This 
subroutine is used to set up the topology of point to point 
connected multicomputer network. 

RETURN VALUE 
None . 

NAME 

connected - checks the connectivity between processing 
e 1 ements . 

SYNOPSIS 

^include "structure . h" 
int connected(a , b> 
int a.b; 

DESCRIPTION 

Connected returns TRUE if the processing elements whose 
node ids are 'a’ and 'b' are connected else returns FALSE if 


not connected. 



RETURN VALUE 


Returns 1 if connected else returns 0. 


NAME 


broad_link - establishes broadcast link between the 
processing elements. 


two 


SYNOPSIS 

rfji'include "structure .h” 
void broad__l ink<a.b, ch> 
i nt a . b , ch ; 

DESCRIPTION 

Broadcast link establishes broadcast link, ’ch' between 
processing elements whose node ids are 'a' and 'b'. This 
routine is used to establish the structure of the broadcast 
communication network of mu 1 1 i computers , 

RETURN VALUE 
None . 


NAME 

get_digit - extracts a specified digit from the address of 
node i d . 


SYNOPSIS 

^include "structure . h" 

int get_d i g i t (nodeaddr , r . d , i > 

int nodeaddr . r . d , i ; 


DESCRIPTION 

get digit extracts the ith digit from the radix *r' 
representation with ’d’ digits in the node identifier 


nodeaddr . 



RETURN VALUE 


Returns the extracted digit. 
NAME 


compl iment 


inverts the specified digit of node address. 


SYNOPSIS 

int compl iment<nodeaddr.d. i ) 
i nt nodeaddr . i . d ; 

DESCRIPTION 

compliment inverts the ith digit of node address having 
’d' digits, nodeaddr. 


RETURN VALUE 

Returns the compl imented node address 


NAME 

replace - substitutes the specified digit of node address 
by given digit. 

SYNOPSIS 

/Hnc lude "structure . h" 

int rep I ace < nodeaddr , r , d . i . j ) 

int nodeaddr . r .d, i , j ; 

DESCRIPTION 

replace, substitutes the ith digit of radix 'r* 
representation of the node identifier, having 'd' digits, 
nodeaddr by the digit ’j'. 


RETURN VALUE 

Returns the substituted node identifier, nodeaddr. 



APPENDIX B; EXAMPLES OF STRUCTURE FILE 
• hp_cube . c ’ 

^include "structure . h" 

int dim; 
int proc_no; 

structma i n ( > 

C 

int peid. id.neighb; 

input(dim) /♦input interface^/ 

no_procs = power< 2 ,dim> ; 
initial! ze(>; 

/♦ Topology Setups/ 

for<peid = 0; peid < no_procs; peid++> 

£ 

for<id =0; id < dim; id++> 

£ 

neighb = compl iment (pe i d . i d , d im) ; 
connect (peid.neighb) ; 

} 

} 

start sim<>; /* Set up the simulation environment */ 

ft,ain<>; Entry point of user programme ♦/ 

clean<); /♦ Remove the communication links */ 

terminate(>; /* Terminate the process 0 ♦/ 

) 

/♦Process creation^/ 

1 f ork < d iml ) 
i nt d 1 ml : 

£ 

int np , k , p . pi ; 
np = power ( 2 , d i ml ) ; 

if < get_node id( ) != 0) 

return ; 

for<p = 0; p < np; p+ + > /♦.Create processes in 

dimension diml^/ 

£ 

k = p+np; 
pf ork <k > ; 
pi = getpid<>; 
if (pi == child__pid) 

£ 

proc_no = k; 
return ; 

1 

> 


) 



/♦Interprocess Communication*/ 
lsend_to(src , diml ,mesg, len> 

int diml , src ; 
char *mesg ; 
i nt I en ; 

C 

int i , dest ; 

dest = compl iment(src, diml .dim) ; 
write__to(dest .mesg, len) ; 

} 


1 recv_f rom(dest . diml .mesg, len) 

int dest .diml ; 
char *mesg ; 
int * 1 en ; 

C 

int i . src ; 

src = comp 1 i ment < dest . diml . d im) ; 
read_f rom< src .mesg , 1 en) ; 

} 

/♦Support Routines*/ 

ge t_d i m( ) 


C 

} 


return dim; 


get_noprocs < ) 


C 

return no_procs ; 

} 


get_node i d< ) 

{ 

return proc_no; 


} 



' mms . c ' 


/fine lude ’’structure . h" 

int dim, drop; 
int proc_no; 

structma i n( > 


C 

int peid.d.p.neighb; 
char strELENGTH]; 

/♦input interface*/ 

i nput (drop) 
input (dim) 

no_procs = power(drop,dim) ; 
initialize<); 

/♦Topology setup*/ 

for<peid = 0; peid < no_procs ; peid++) 

C 

for(d =0; d < dim; d++) 

C 

for<p = 0; p < drop; p++) 

C 

if (get_di gi t<peid.drop,dim.d) *= p) 
connect (peid, replace (pe id, drop, dim, d, p) ) ; 

> 

} 

} 


start_s im( ) ; 
ma i n < ) ; 
clean( ) ; 
term I nat e ( ) : 

} 


/♦ Set up the simulation environment ♦/ 
/♦ Entry point of the user programme */ 
/♦ Remove the communication links ♦/ 

/♦ Terminate the process 0 */ 


/♦Process Creation*/ 

i f ork<diml ) 
int diml ; 


C 


int np . npl , k . p, pi ; 

np = power < drop , diml ) ♦ (drop -1); 
npl = power ( drop , diml ) ; 

if( get__node i d( ) != 0) 
return ; 

for< p = 0: p < np ; p++)/*Create the processes 

in dimension diml*/ 

{ 

k = pfnpl ; 
pfork ( k ) ; 
pi = getpi d< ) ; 



if( pi == child_pid) 

C 

proc__no = k ; 
return ; 

> 

} 

) 

/♦Interprocess Communication*/ 

lsend_to<src.diml .dr.mesg. len) 
i nt src . d iml , dr ; 

char ♦mesg; ^ 

int len; 

C 

int i ,dest ; ^ s 

dest = replace (src , drop. dim, diml , dr > ; 
wri te_to<dest .mesg. len> ; 

) 

lrecv__f rom<dest .diml .dr.mesg. len) 

int dest, diml .dr; 
char *mesg; 
int *len; 

C 

int i . src ; \ 

src = replace<dest .drop. dim. diml .dr) ; 

read_trom(src .mesg, len) ; 

^ /*Support Routines*/ 

get_d!m< ) 

C 

return dim; 

) 

get__drop< > ' 

C 

return drop; 

} 

get_noprocsC ) 

£ 

return no_procs ; 

} 

get_node i d ( ) 

£ 

return proc__no; 

1 



c’ 


^include "structure . h" 
int mesh_s i 26 . proc_no ; 

structmain< > 


C 

int peid.neighb; 


/♦Input interface^/ 


input <mesh__s i ze > 

no_procs = mesh_s i ze*nnesh__^s i ze ; 

ini t ial ize< > ; 


for<peid «= 0; 


/♦ Topology Setup */ 
peid < no_procs; peid++ > 


C 

if<<neighb = get_ne ighb<pe id.LEFT) ) != -1> 

connect{peid. neighb); 

if<<neighb = get_ne i ghb<pe i d .RIGHT ) ) i= -i) 

connect<peid. neighb); 

if<<neighb = get_ne i ghb< pe id , ABOVE ) ) »= -1) 
connect(peid, neighb); 
if<(neighb = get_neighb<peid, BELOW)) != -1) 
connect ( pe i d . neighb); 

3 


/* Set up the simulation environment */ 
/* Entry pont of the user programme */ 

/* Remove the communication links */ 
/* Terminate the process 0 */ 

3 


star t__s 1 m< ) : 
ma I n < ) : 
c I ean < ) ; 
t ermi nat e < ) ; 


/♦ Process Creation ♦/ 

1 fork < row) 
int row; 

C 

int np.npl , p.k ,pl ; 

if(row == 0) 

{ 

npl = 1 ; 

np = mesh__s i ze- 1 ; 

3 

else 

C 

npl = row * mesh__si2e; 
np = mesh size; 



i f (get_node id( ) != 0> 
return ; 

for<p = 0; p < np; P++) 
C 

k = p+npl ; 
pfork<k) ; 
pi = getpi d< ) ; 
ifCpl == child_pid> 
{ 

proc_no = k ; 
return; 

3 

3 

3 


/* Interprocess Communication */ 
int lsend_to(src,dir.dataptr, len) 
i nt src . di r ; 
char *dataptr: 
int len; 


int dest; 

dest = get_ne i ghb< src , di r ) ; 
wr i te_to<dest , dataptr , len) ; 
3 


int 1 r ecv_f rom< dest , d i r , dataptr , 1 en) 
int dest , d 1 r ; 
char *dataptr: 
int * 1 en ; 

C 

int src; 

src = get_ne i ghb<dest , d i r ) ; 
read_f rom<src, dataptr, len); 

3 


/* Support Routines */ 

get_node i d( ) 

C 

return proc_no; 

3 

get_noprocs < ) 

{ 

return no procs ; 


/♦Create the processes 
in the given row ♦/ 


3 



ge t_me s ^'_s i z e ( ) 

£ 

return rftesh_size; 

} 

int get_ne ighb(pe id.dir) 
int peid.dir; 

C 

int neighb; 

swi tch<di r> 

£ 

case LEFT: if<< peid % mesh_size> != 0) 
return peid-1; 
else return -1; 
break ,* 

case RIGHT; if<<< peid+l> % mesh_size> != 0) 
return pe id+1 ; 
else return -1; 
break; 

case ABOVE: if< peid >= mesh_size> 
return peid - mesh__size; 
else return -1; 
break ; 

case BELOW: if< peid < mesh__s i ze*(mesh_^s i ze-1 > ) 
return peid + mesh_si2e; 
else return -1; 
break ; 

> 


> 



* tree . c 


^^|'include "structure . h” 

int height; 
int proc_no; 

structma in< > 

C 

int pe i d . 1 ef t_ch i Id , r i ght_chi Id, parent , leaf ; 


f* Input interface */ 

i nput (height) 

no_procs = power < 2 , he i ght ) -1; 
initialize <); 


/* Topology Setup */ 
for<peid = 0; peid < no_procs; peid++) 

C 

if<( parent = get_parent<pe id) ) != -1) 
connect (pe i d. parent ) ; 

if ( ( left__chi Id = get_l ef t_ch i ld( pe i d) ) != -1) 
connect (peid, 1 ef t_chi Id) ; 

if ( (right_chi Id = get_r ight_ch i ld(pe i d) ) != -1) 
connect (pe i d. r i ght_ch i Id) ; 

} 


start__s im( ) ; /* Set up the simulation environment */ 
main(); /* Entry point of user programme */ 

clean(); /* Remove coiwnunicat ion links */ 

terminate(); /* Terminate process 0 */ 

) 


/# Process Creation */ 

1 fork ( 1 eve 1 1 ) 
i nt 1 eve 1 1 ; 

C 

int np.npl .k,p,pl ; 
np = power( 2 , level 1 ) ; 
npl = np -1 ; 
if ( get^node i d( ) != 0) 

return ; 

for<p = 0; P < np; p++) /* Create processes at 

level levell 

C 

k = p+npl ; 
pf ork ( k ) : 
pi = getp i d( ) ; 
if (pi == child^pid) 

C 

proc^no = k; 
return ; 

> 



int send_to_parent(src.dlataptr, len) nication 

int src! ’ ' 

char *dataptr; 
int len; 

C 

i nt dest ; 

dest = get_parent(src> ; 
wr i te_to<dest . dataptr , len) ; 

} 

int recv_from_parent<dest. dataptr, len) 

int dest ; 
char *dataptr; 
int *len; 

C 

int src ; 

src = get_parent<dest) ; 
wr i te_to(src . dataptr , len) ; 

> 


int send_to_l ef tchi 1 d< src , dataptr . len) 

int src; 

char ♦dataptr; 

int len; 

{ 

int dest; 

dest = get_left_chi ld(src> ; 
writp_to(dest .dataptr. len) ; 

} 

int recv__f rom_lef tchi ld(dest . dataptr , len) 
int dest; 
char ♦dataptr; 
int *len; 

C 

int src; 

src = get_l ef t__ch i 1 d<dest ) ; 
read_from( src, dataptr. len) ; 

) 


int send_to_r ightchi 1 d<src , dataptr . len) 

int src; 

char ♦dataptr; 

int len; 


♦/ 



c 

int dest; 

dest = get_right_chi ld(src> ; 
wr i te^to ( dest , dataptr , len> ; 

} ” 

int recv_f rom_r i ghtchi ld(dest .dataptr , len> 
int dest; 
char *dataptr; 
int *1 en ; 

{ 

int src; 

src = get_right__chi Id(dest) ; 
read_f rom< src . dataptr . ien) ; 

) 


/# Support Routines */ 

get_^he i ght < ) 

C 

return height; 

> 


get_noprocs < ) 

C 

return no^^procs; 

} 

get_node i d( > 

{ 

return proc_no; 

3 

int get_parent<peid> 
int peid; 

C 

int parent; 

if<peid == 0) 
return -1 ; 

else { 

if<peid % Z == 0) 
parent = peid/2 -I; 
else parent *= peid/2; 
return parent; 

3 


3 



*br_cub« ,c ’ 

^include ’’structure . h” 


int dim. drop; 
int proc_no ; 

structma in< ) 

{ 

int pe id, id, i 1 ; 
char strCLENGTH]; 

input (dim) 
input <drop> 


/♦ Input interface */ 


no_procs = power< 2 .dim) ; 
initiali2e<>; 

/* Topology Setup */ 
for<peid = 0; peid < no procs; peid++> 

C 

for<d = 0; d < dim; d++) 

C 

for<p = 0; p < drop; p++) 

£ 

ch = get__channel<peid,d>; 
i f ( get__di g i t ( pe i d .drop .dim, d) != p> 

^ broad_l ink(pe id. rep lace < peid, drop, dim. d.p) ,ch> ; 

) 

} 


start__s im< > ; 
ma i n < > ; 

clean(): /^Delete the communication links#/ 

t e rm 1 na te < > ; 

3 

/# Process Creation */ 

lfork<diml > 
int dim! ; 

£ 

int np.k.p.pl; 

np = power< 2 .dirnl > ; 
if (get___node id( ) != 0) 
return ; 

for<p = 0; p < np; p+f) /* CHANGE*/ 
f 

k = p+np; 

pf ork < k ) ; 

p3 = getp 1 d< ) ; 

i f <pl =- child pid) 

£ ~ 
proc__no = k; 
return ; 

3 


} 



/■» Interprocess 

gsend_to(src .diml .mesg. len) 


Communication 


♦ / 


int src.diml ; 
char *mesg ; 
i nt 1 en ; 


{ 

i nt ch ; 

ch = get___channe 1 < src .diml ) ; 
bwrite<ch.mesg. len) ; 

) 


grecv__f rom( dest .diml .mesg. len) 

int dest. diml ; 
char *mesg ; 
int * 1 en ! 

C 

int ch ; 

ch = get_channe 1 (dest , diml > ; 
bread ( ch .mesg. I en) ; 

} 


f* Support Routines */ 

get_dim( ) 

C 

return dim; 

) 


ge t_nc(proc < ) 

{ 

return no__procs; 

) 


get__node i d( ) 

C 

return proc__no; 

) 

get__channe I ( pe i d . d) 
int pe i d . d : 

C 

int ch; 
if<dim == 0) 
ch = peid/drop; 

else ch = (peid % power (drop, d) ) + (drop#d) 
return rh- 



To run the program on the simulator user has to do the 
following things: 

Writing the program 

User should call only the routines that are available 
in the concerned structure files in his program. For example 
if the user wants to run the program on hypercube structure 
he should refer the file hp_hube.c and can use the rouines 
that are available in that file. Please refer APPENDIX B for 
the details of different structures that are already defined. 
User should see that final result of the program is always 
computed at processor 0. 

Preparation 

Copy the following files to your area. 

i) ~ j shree/arch/pro j /cc_hpcube to run the program on 
hypercube . 

ii) ~ j shree/arch/pro j /cc_mesh to run the program on 2__d 
mesh 

iii> ~ j shree/arch/pro j /cc_mms to run the program on MMS 

iv> ~ jshree/arch/pro j/cc_tree to run the program on tree 

v> ~ jshree/arch/pro j/cc_brhpcube to run the program on 
broadcast hypercube 

Compiling the program 

To compile the user program he should use concerned 
compiling command. cc_hpcube, cc__mms, cc_mesh, cc_tree are 
used to compile the user programmes to simulate hypercube. 



MMS , Z_d mesh and tree structures respectively. cc_brhpcube 
is used to compile the programs written for hypercube with 
broadcast communication. cc__hpcube , cc_mms , cc_mesh, 
cc_treeand cc_brhpcube have got the same usage as that of 
usual c compiler cc. 
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